Goto

Collaborating Authors

 edge pruning


Finding Transformer Circuits With Edge Pruning

Neural Information Processing Systems

The path to interpreting a language model often proceeds via analysis of circuits---sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they either rely on inefficient search algorithms or inaccurate approximations. In this paper, we frame circuit discovery as an optimization problem and propose as an effective and scalable solution.


all of the components ", has an interesting "idea of stabilizing training ", and "achieves state-of-the-art performance. "

Neural Information Processing Systems

We thank the reviewers for their time and valuable feedback. Below, we clarify several important points raised by the reviewers. An extra page in the final version will allow us to include the requested details. We believe these clarifications, together with new analyses, resolve all key issues raised. Rep'16] and provides a highly constraining measure of local topology.


PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization

Wang, Xinhai, Yang, Shu, Wang, Liangyu, Zhang, Lin, Xie, Huanyi, Hu, Lijie, Wang, Di

arXiv.org Artificial Intelligence

Circuit discovery, which involves identifying sparse and task-relevant subnetworks in pre-trained language models, is a cornerstone of mechanistic interpretability. Automated Circuit Discovery (ACDC) has emerged as a pivotal methodology in circuit discovery, but its application to large language models is severely limited by computational inefficiency and prohibitively high memory requirements. Although several accelerated approaches have been proposed, they primarily rely on linear approximations to ACDC, which significantly compromises analytical faithfulness. Our proposed method for accelerating automated circuit discovery, Per Attention Head Quantization (PAHQ), takes a fundamentally different approach by optimizing the efficiency of each individual patching operation. PAHQ leverages a fundamental alignment between activation patching and mixed-precision quantization (MPQ): interpretability analysis through patching essentially performs targeted ablation studies. Therefore, we can maintain high precision exclusively for investigated components while safely reducing precision elsewhere in the network. PAHQ-accelerated ACDC reduces runtime by up to 80\% and memory consumption by up to 30\% compared to unaccelerated ACDC while maintaining faithfulness. Importantly, our method readily integrates with existing edge-based circuit discovery techniques by modifying the attention computation mechanism. This training-free approach provides a practical and novel pathway for accelerating mechanistic interpretability methods. Our code is available at https://github.com/626619403/PAHQ.



BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Mondorf, Philipp, Wang, Mingyang, Gerstner, Sebastian, Hakimi, Ahmad Dawar, Liu, Yihong, Veloso, Leonor, Zhou, Shijia, Schütze, Hinrich, Plank, Barbara

arXiv.org Artificial Intelligence

The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.


all of the components ", has an interesting "idea of stabilizing training ", and "achieves state-of-the-art performance. "

Neural Information Processing Systems

We thank the reviewers for their time and valuable feedback. Below, we clarify several important points raised by the reviewers. An extra page in the final version will allow us to include the requested details. We believe these clarifications, together with new analyses, resolve all key issues raised. Rep'16] and provides a highly constraining measure of local topology.



Finding Transformer Circuits With Edge Pruning

Neural Information Processing Systems

The path to interpreting a language model often proceeds via analysis of circuits---sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they either rely on inefficient search algorithms or inaccurate approximations. In this paper, we frame circuit discovery as an optimization problem and propose Edge Pruning as an effective and scalable solution. Our method finds circuits in GPT-2 that use less than half the number of edges than circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks.


Spectral Theory for Edge Pruning in Asynchronous Recurrent Graph Neural Networks

Bessone, Nicolas

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have emerged as a powerful tool for learning on graph-structured data, finding applications in numerous domains including social network analysis and molecular biology. Within this broad category, Asynchronous Recurrent Graph Neural Networks (ARGNNs) stand out for their ability to capture complex dependencies in dynamic graphs, resembling living organisms' intricate and adaptive nature. However, their complexity often leads to large and computationally expensive models. Therefore, pruning unnecessary edges becomes crucial for enhancing efficiency without significantly compromising performance. This paper presents a dynamic pruning method based on graph spectral theory, leveraging the imaginary component of the eigenvalues of the network graph's Laplacian.


Finding Transformer Circuits with Edge Pruning

Bhaskar, Adithya, Wettig, Alexander, Friedman, Dan, Chen, Danqi

arXiv.org Artificial Intelligence

The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.